Chicago has been in the news a lot lately especially due to the recent up surge of murders resulting from gun violence. As a resident of Chicago, I am aware that most of these shootings are confined to certain parts of the city but to truly understand the the true nature of this violence, I decided to dig deeper into the data available from the City of Chicago.
I performed some data exploration and visualization in this notebook with the hope of uncovering some interesting insights along the way. I used publicly available data to explore crime in Chicago from January 2001 to February 2018. The data is available from the City of Chicago's data portal linked below.
Data Sources: .
The intended audience for the format of this notebook is anyone with a slight bend to the technical side of data exploration and analysis and as such I will describe some of the technical steps I took to get a result. I will keep this to a minimum but I realize that most people looking at this notebook have a background in data science and would like to know how I got there. Either way it should be easy to pick up past any technical explanations or code cells.
The notebook work flow is as follows:
# import modules
import numpy as np
import pandas as pd
from pandas import *
import os
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import datetime
from scipy import stats
sns.set_style("darkgrid")
import matplotlib.image as mpimg
import folium
from folium import plugins
from folium.plugins import MarkerCluster, FastMarkerCluster, HeatMapWithTime
# use TextFileReader iterable with chunks of 100,000 rows.
tp = read_csv('Crimes_-_2001_to_present.csv', iterator=True, chunksize=100000)
crime_data = concat(tp, ignore_index=True)
# print data shape
crime_data.shape
# view the basic stats on columns
crime_data.info()
# show first five rows of the dataframe
crime_data.head()
# show dataframe columns
crime_data.columns
# print all crime variables in the "Primary Type" column
crimes = crime_data['Primary Type'].sort_values().unique()
crimes, len(crimes)
# plot an image of Chicago police districts
# Image Credit: https://academic.oup.com/aler/article/16/1/117/135028
plt.figure(figsize=(10,18))
img = mpimg.imread('chicago_commuity _areas_map.png')
plt.imshow(img)
# print details of columns that group the data into geographic regions
print('Ward column has {} null values'.format(crime_data['Ward'].isnull().sum()))
print('Community Area column has {} null values'.format(crime_data['Community Area'].isnull().sum()))
print('District column has {} null values'.format(crime_data['District'].isnull().sum()))
# Created a scatterplot of X and Y coordinates vs all crime data available in the dataset
crime_data = crime_data.loc[(crime_data['X Coordinate']!=0)]
sns.lmplot('X Coordinate',
'Y Coordinate',
data=crime_data[:],
fit_reg=False,
hue="District",
palette='Dark2',
size=12,
ci=2,
scatter_kws={"marker": "D",
"s": 10})
ax = plt.gca()
ax.set_title("All Crime Distribution per District")
# create and preview dataframe containing crimes associated with gang violence
col2 = ['Date','Primary Type','Arrest','Domestic','District','X Coordinate','Y Coordinate']
multiple_crimes = crime_data[col2]
multiple_crimes = multiple_crimes[multiple_crimes['Primary Type']\
.isin(['HOMICIDE','CONCEALED CARRY LICENSE VIOLATION','NARCOTICS','WEAPONS VIOLATION'])]
# clean some rouge (0,0) coordinates
multiple_crimes = multiple_crimes[multiple_crimes['X Coordinate']!=0]
multiple_crimes.head()
# visualize each geo distribution scatter plot for each of the crimes in the group
g = sns.lmplot(x="X Coordinate",
y="Y Coordinate",
col="Primary Type",
data=multiple_crimes.dropna(),
col_wrap=2, size=6, fit_reg=False,
sharey=False,
scatter_kws={"marker": "D",
"s": 10})
After a few basic observations, it was time to retrieve a dataframe with the homicide data and do some data cleaning before I could go on to explore it in depth. Some of the steps to achieve the final homicide dataframe may have been performed simultaneuosly but I demonstrated the first few in a bit more detail.
# create a datframe with Homicide as the only crime
df_homicideN = crime_data[crime_data['Primary Type']=='HOMICIDE']
df_homicideN.head()
# print some attributes of our new homicide dataframe
df_homicideN.info()
# find null values in our dataframe
df_homicideN.isnull().sum()
# size of dataframe the null values take up
df_homicideN[df_homicideN.isnull().any(axis=1)].shape
From the cells above we can obsereve:
I have a total of 463 rows containing one or more null values.
Next I dropped all the rows containing null values from the dataframe to be able to perform different data transformations to the dataset.
# drop null values and confirm
df_homicide = df_homicideN.dropna()
df_homicide.isnull().sum().sum()
# create a list of columns to keep and update the dataframe with new columns
keep_cols = ['Year','Date','Primary Type','Arrest','Domestic','District','Location Description',
'FBI Code','X Coordinate','Y Coordinate','Latitude','Longitude','Location']
df_homicide = df_homicide[keep_cols].reset_index()
df_homicide.head()
# change string Date to datetime.datetime format
df_homicide['Date'] = df_homicide['Date'].apply(lambda x: datetime.datetime.strptime(x,"%m/%d/%Y %I:%M:%S %p"))
df_homicide.head()
# create new columns from date column -- Year, Month, Day, Hour, Minutes, DayOfWeek
df_homicide['Year'] = df_homicide['Date'].dt.year
df_homicide['Month'] = df_homicide['Date'].dt.month
df_homicide['Day'] = df_homicide['Date'].dt.day
df_homicide['Weekday'] = df_homicide['Date'].dt.dayofweek
df_homicide['HourOfDay'] = df_homicide['Date'].dt.hour
df_homicide = df_homicide.sort_values('Date')
df_homicide.head()
# print columns list and info
df_homicide.info()
So far I had created a leaner pandas dataframe with just the relevant data from the original crime dataframe. In the next section we will get our hands dirty exploring and visualizing the new homicide dataframe. To be able to pick up from where we left off, I will stored the new data in pickle form which is a way in Python of saving an object to file for later retrieval. Pickling accomplishes this by writing the object as one long stream of bytes.
.
# save cleaned data to pickle file for easier loading from notebook start
df_homicide.to_pickle('df_homicide.pkl')
print('pickle size:', os.stat('df_homicide.pkl').st_size)
# load pickle file
df_homicide = pd.read_pickle('df_homicide.pkl')
df_homicide.head()
.
.
In this section I attempted to use the newly created dataframe to paint a picture of homicides across the city of Chicago.
# plot all homicides in dataset by location per District
df_homicide = df_homicide.loc[(df_homicide['X Coordinate']!=0)]
sns.lmplot('X Coordinate',
'Y Coordinate',
data=df_homicide[:],
fit_reg=False,
hue="District",
palette='Dark2',
size=12,
ci=2,
scatter_kws={"marker": "D",
"s": 10})
ax = plt.gca()
ax.set_title("All Homicides (2001-2018) per District")
First I will drew inspiration from the earlier scatter plots to visualize all homicides from 2001 to February 2018 by districts. Above is the resulting plot.
Since each district is a different color, the big insight we can get from this chart is that the districts with a lot of markers clustered together have more homicide rates than the ones that do not. To some degree we have an idea of homicide from a spatial dimension but I would also wanted to understand these homicide rates from a time dimension.
.
Below I created some homicide vs time visualization and made notes on the observations.
# plot bar chart of homicide rates for all years
plt.figure(figsize=(12,6))
sns.barplot(x='Year',
y='HOMICIDE',
data=df_homicide.groupby(['Year'])['Primary Type'].value_counts().\
unstack().reset_index(),
color='steelblue').\
set_title("CHICAGO MURDER RATES: 2001 - 2018")
From the above barchart visualization we can draw the following conclusions:
# plot homicides sorted by month
fig, ax = plt.subplots(figsize=(14,6))
month_nms = ['January','February','March','April','May','June','July','August'\
,'September','October','November','December']
fig = sns.barplot(x='Month',
y='HOMICIDE',
data=df_homicide.groupby(['Year','Month'])['Primary Type'].\
value_counts().unstack().reset_index(),
color='#808080')
ax.set_xticklabels(month_nms)
plt.title("CHICAGO MURDER RATES by MONTH -- All Years")
# -------------------------------------------
# plot average monthly temps in Chicago
# source of data: ncdc.noaa.gov
mntemp = [26.5,31,39.5,49.5,59.5,70,76,75.5,68,56,44,32]
df_temps = pd.DataFrame(list(zip(month_nms,mntemp)),
columns=['Month','AVERAGE TEMPERATURE'])
fig, ax = plt.subplots(figsize=(14,6))
sns.barplot(x='Month', y='AVERAGE TEMPERATURE', data=df_temps,color='steelblue')
# plot homicide rates vs. day of the week
fig, ax = plt.subplots(figsize=(14,6))
week_days = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
fig = sns.barplot(x='Weekday',
y='HOMICIDE',
data=df_homicide.groupby(['Year','Weekday'])['Primary Type'].\
value_counts().unstack().reset_index(),
color='steelblue')
ax.set_xticklabels(week_days)
plt.title('HOMICIDE BY DAY OF THE WEEK -- All Years')
Next we digger deeper for a more granular look from the monthly to the weekly timeframe.
.
# use seaborn barplot to plot homicides vs. hour of the day
fig, ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x='HourOfDay',
y='HOMICIDE',
data=df_homicide.groupby(['Year','HourOfDay'])['Primary Type'].\
value_counts().unstack().reset_index(),
color='steelblue',
alpha=.75)
plt.title('HOMICIDE BY HOUR OF THE DAY -- All Years')
Continuing with the time frame analysis, it made sense to look at homicide rates in Chicago from a time of the day perspective. I think it is easy to make assumptions that murders occur more at night than during the day but at best it's just a guess so it was wise to look at the numbers no matter how obviuos we think the conclusion is.
One way to look at the above plot is from a normal work day or school day perspective.
# plot domestic variable vs. homicide variable
fig, ax = plt.subplots(figsize=(14,6))
df_arrest = df_homicide[['Year','Domestic']]
ax = sns.countplot(x="Year",
hue='Domestic',
data=df_arrest,
palette="Blues_d")
plt.title('HOMICIDE - DOMESTIC STATS BY YEAR')
# visualize the "scene of the crime" vs. number of occurences at such scene
crime_scene = df_homicide['Primary Type'].\
groupby(df_homicide['Location Description']).\
value_counts().\
unstack().\
sort_values('HOMICIDE',ascending=False).\
reset_index()
# Top Homicide Crime Scene Locations
crime_scene.head(10)
# create a count plot for all crime scene locations
g = sns.factorplot(x='Location Description',
y='HOMICIDE',
data=crime_scene,
kind='bar',
size=10,
color='steelblue',
saturation=10)
g.fig.set_size_inches(15,5)
g.set_xticklabels(rotation=90)
plt.title('CRIME SCENE BY LOCATION FREQUENCY')
# create a heatmap showing crime by district by year
corr = df_homicide.groupby(['District','Year']).count().Date.unstack()
fig, ax = plt.subplots(figsize=(15,13))
sns.set(font_scale=1.0)
sns.heatmap(corr.dropna(axis=1),
annot=True,
linewidths=0.2,
cmap='Blues',
robust=True,
cbar_kws={'label': 'HOMICIDES'})
plt.title('HOMICIDE vs DISTRICT vs YEAR')
Above I created a heatmap comparing all Chicago Police Districts vs. the number of Homicides per district from 2001 to 2017.
.
with sns.plotting_context('notebook',font_scale=1.5):
sorted_homicides = df_homicide[df_homicide['Year']>=2016].groupby(['District']).count()\
.Arrest.reset_index().sort_values('Arrest',ascending=False)
fig, ax = plt.subplots(figsize=(14,6))
sns.barplot(x='District',
y='Arrest',
data=sorted_homicides,
color='steelblue',
order = list(sorted_homicides['District']),
label='big')
plt.title('HOMICIDES PER DISTRICT (2016-2017) - Highest to Lowest')
As we observed from the heat map earlier, some districts are more dangerous than others and now we can explore this observation more closely. From the various visualizations we have created, we can at least conclude that 2016 and 2017 have been the most active years for homicides in Chicago.
In the barplot above I visualize the most and least murders per district in this two year period to see their relatonship.
.
Next I looked at the relationship between arrest and homicides by plotting some relationships below.
# create seaborn countplots for whole dataset
fig, ax = plt.subplots(figsize=(14,6))
df_arrest = df_homicide[['Year','Arrest']]
ax = sns.countplot(x="Year",
hue='Arrest',
data=df_arrest,
palette="PuBuGn_d")
plt.title('HOMICIDE - ARRESTS STATS BY YEAR')
The chart above compares arrest vs. non-arrests for all homicides in the dataset for all 17 years.
# create seaborn countplots for 2016 and 2017 -- high crime rate spike years
fig, ax = plt.subplots(figsize=(14,6))
ax = sns.countplot(x="Month",
hue='Arrest',
data=df_homicide[df_homicide['Year']>=2016][['Month','Arrest']],
palette="PuBuGn_d")
month_nms = ['January','February','March','April','May','June','July',\
'August','September','October','November','December']
ax.set_xticklabels(month_nms)
plt.title('HOMICIDE - ARRESTS STATS BY MONTH -- (2016-2018)')
# create seaborn lmplot to compare arrest rates for different districts
dfx = df_homicide[df_homicide['District'].\
isin(list(sorted_homicides.head(10)['District']))].\
groupby(['District','Year','Month','Arrest'])['Primary Type'].\
value_counts().unstack().reset_index()
with sns.plotting_context('notebook',font_scale=1.25):
sns.set_context("notebook", font_scale=1.15)
g = sns.lmplot('Year','HOMICIDE',
col='District',
col_wrap=5,
size=5,
aspect=0.5,
sharex=False,
data=dfx[:],
fit_reg=True,
hue="Arrest",
palette=sns.color_palette("seismic_r", 2),
scatter_kws={"marker": "o",
"s": 7},
line_kws={"lw":0.7})
A further break down of the arrest vs. non-arrest on a district level, of just the high homicide districts, added more weight to my previous observation of decreasing number of arrests over the years in the dataset. The plots above show regression lines of both arrests and non-arrests on the same axis for the top 10 most dangerous districts.
The trend of police making less arrests than the previous year had already began before 2006 and by 2010 the tables flipped in all districts to making arrests on less than half of all homicides.
It may well be tha the odds of getting arrested for a homicide are less than half. The odds get much better if you are a criminal living in a high crime district.
.
Next I used folium which is a python visualization album to create some maps illustrating some of my findings and make notes below. The maps are user responsive so you can zoom in and out and move them around.
# plot chloropleth maps for all full years in dataset
def toString(x):
return str(int(x))
df_homicide_allyears = df_homicide.groupby(['District']).count().Arrest.reset_index()
df_homicide_allyears['District'] = df_homicide_allyears['District'].apply(toString)
# ______________________________________________________#
chicago = location=[41.85, -87.68]
m = folium.Map(chicago,
zoom_start=10)
plugins.Fullscreen(
position='topright',
title='Expand me',
title_cancel='Exit me',
force_separate_button=True).add_to(m)
m.choropleth(
geo_data='chicago_police_districts.geojson',
name='choropleth',
data=df_homicide_allyears,
columns=['District', 'Arrest'],
key_on='feature.properties.dist_num',
fill_color='YlOrRd',
fill_opacity=0.4,
line_opacity=0.2,
legend_name='Choropleth of Homicide per Police District : 2001-2017',
highlight=True
)
folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m
# plot 2016-2018 chloropleth map
def toString(x):
return str(int(x))
df_homicide_after_2015 = df_homicide[df_homicide['Year']>=2016].groupby(['District']).count().Arrest.reset_index()
df_homicide_after_2015['District'] = df_homicide_after_2015['District'].apply(toString)
# ______________________________________________________#
chicago = location=[41.85, -87.68]
m = folium.Map(chicago,
zoom_start=10)
plugins.Fullscreen(
position='topright',
title='Expand me',
title_cancel='Exit me',
force_separate_button=True).add_to(m)
m.choropleth(
geo_data='chicago_police_districts.geojson',
name='choropleth',
data=df_homicide_after_2015,
columns=['District', 'Arrest'],
key_on='feature.properties.dist_num',
fill_color='YlOrRd',
fill_opacity=0.4,
line_opacity=0.2,
legend_name='Homicide per Police District : 2016-2017',
highlight=True
)
folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m
.
# plot heatmap all districts -- (2016-2018)
after_2015_geo = []
for index, row in df_homicide[df_homicide['Year']>=2016][['Latitude','Longitude','District']].dropna().iterrows():
after_2015_geo.append([row["Latitude"], row["Longitude"],row['District']])
# _______________________________________________________________________
chicago = location=[41.85, -87.68]
m = folium.Map(chicago, zoom_start=9.5,control_scale = False)
plugins.Fullscreen(
position='topright',
title='Expand me',
title_cancel='Exit me',
force_separate_button=True).add_to(m)
m.choropleth(
geo_data='chicago_police_districts.geojson',
name='choropleth',
data=df_homicide_after_2015,
columns=['District', 'Arrest'],
key_on='feature.properties.dist_num',
fill_color='YlOrRd',
fill_opacity=0.4,
line_opacity=0.2,
legend_name='HeatMap Homicides : 2016-2017',
highlight=True
)
m.add_child(plugins.HeatMap(after_2015_geo,
name='all_homicides_2016_to_2017',
radius=5,
max_zoom=1,
blur=10,
max_val=3.0))
folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m
The heat mapping visualizations above were creaed with pythons Folium visualization library that gives great control on plotting dynamic maps on a notebook.
The heat map is for the 2016-2017 time frame. The individual crime scene locations from this time frame are mapped on the city's canvas and represented as colors.
The heatmap can help us visualize general locactions on a map where there is a high frequency or low frequency of crimes. The dark orange represent areas where crime is high and the disconnected light yellow / green shades show locations on a map where crime is low.
The immediate impression is that homicides are higher on the west side and south side of Chicago. We can zoom in closer for more details per area where with every zoom, more clusters are revealed within that subregion.
Note:// - The time lapse heatmap below has added user control features.
- There is a control bar at the bottom of the map with Play, Pause, Loop, Scroll Bar etc.
- The Date is displayed on the bar as the last 2 digits of the corresponding year.
# plot yearly time lapse heatmap all districts -- (2001-2017)
chicago = location=[41.85, -87.68]
m = folium.Map(chicago, zoom_start=9.5,control_scale = False)
plugins.Fullscreen(
position='topright',
title='Expand me',
title_cancel='Exit me',
force_separate_button=True).add_to(m)
m.choropleth(
geo_data='chicago_police_districts.geojson',
name='choropleth',
data=df_homicide_allyears,
columns=['District', 'Arrest'],
key_on='feature.properties.dist_num',
fill_color='YlOrRd',
fill_opacity=0.2,
line_opacity=0.2,
legend_name='Homicides : 2001-2017',
highlight=True
)
heat_df = df_homicide[df_homicide['Year']>=2001].reset_index()
heat_df = heat_df[['Latitude', 'Longitude','Year']]
heat_df['Weight'] = heat_df['Year'].astype(float)
heat_df = heat_df.dropna(axis=0, subset=['Latitude','Longitude', 'Weight'])
heat_data = [[[row['Latitude'],row['Longitude']] for index, row in\
heat_df[heat_df.Weight == i].iterrows()] for i in range(2001,2018)]
m.add_child(plugins.HeatMapWithTime(data=heat_data,
auto_play=True,
max_opacity=0.8,
display_index=True,
radius=9,
name='HeatMapWithTime')
)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m
.
# geo locations of homicides crime scenes -- 2016-2017
df_homicide_after_2015 = df_homicide[df_homicide['Year']>=2016].groupby(['District']).count().Arrest.reset_index()
df_homicide_after_2015['District'] = df_homicide_after_2015['District'].apply(toString)
after_2015 = df_homicide[df_homicide['Year']>=2016].dropna()
# _____________________________________________
lats = list(after_2015.Latitude)
longs = list(after_2015.Longitude)
locations = [lats,longs]
m = folium.Map(
location=[np.mean(lats), np.mean(longs)],
zoom_start=10.3
)
plugins.Fullscreen(
position='topright',
title='Expand me',
title_cancel='Exit me',
force_separate_button=True).add_to(m)
FastMarkerCluster(data=list(zip(lats, longs))).add_to(m)
m.choropleth(
geo_data='chicago_police_districts.geojson',
name='choropleth',
data=df_homicide_after_2015,
columns=['District', 'Arrest'],
key_on='feature.properties.dist_num',
fill_color='YlOrRd',
fill_opacity=0.4,
line_opacity=0.2,
legend_name='Homicides : 2016-2017',
highlight=False
)
# folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m
.
# geo locations of homicides -- January, February 2018
df_homicide_2018 = df_homicide[df_homicide['Year']==2018].groupby(['District']).count().Arrest.reset_index()
df_homicide_2018['District'] = df_homicide_2018['District'].apply(toString)
only_2018 = df_homicide[df_homicide['Year']==2018].dropna()
# _____________________________________________
lats = list(only_2018.Latitude)
longs = list(only_2018.Longitude)
locations = [lats,longs]
m = folium.Map(
location=[np.mean(lats), np.mean(longs)],
zoom_start=10.3
)
plugins.Fullscreen(
position='topright',
title='Expand me',
title_cancel='Exit me',
force_separate_button=True).add_to(m)
FastMarkerCluster(data=list(zip(lats, longs))).add_to(m)
m.choropleth(
geo_data='chicago_police_districts.geojson',
name='choropleth',
data=df_homicide_2018,
columns=['District', 'Arrest'],
key_on='feature.properties.dist_num',
fill_color='YlOrRd',
fill_opacity=0.4,
line_opacity=0.2,
legend_name='Homicides : January, February 2018',
highlight=False
)
# folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m
Finally I created 2 visualizations mapping individual crime scenes which is more or less like the heat mapping without the color added to these geo locations. One of the maps is for the 2016-2017 period and the other is for the 2 month data available for 2018 (January and February).
The January and February map is particularly interesting because it shows where crime is already starting to appear in larger numbers at the beginning of 2018. Sadly, the homicides follow the same pattern as previous years with an overwhelming majority of the crimes happenning in the same high homicide districts.
The map would be a great tool for anyone in the right field trying to identify which problem spots need more resources. The maps can also be great tool for visually inspecting individual crime locations for people in the community and combined with their knowledge of the area could bring new insights that can be explored and acted upon to make the community safer. An example would be community members identifying a problem spots and establishing a safe passage route for children to and from school. The map would be a great aid in helping them identify blocks that are less dangerous.
Throughout this notebook, I used data compiled by the Chicago police department to extract some insights on homicides in Chicago. While the data analysis performed here is deviod of a national perspective my key findings can be distilled to a few key points below:
While these observations are very illuminating, I should mention that they don't paint the whole picture and comparisons should be made with national or other comparable data to give them more perspective. For example, while the number of homicides per year seem high, we don't know if that is within the national average or too high. Such comparisons would help us build more compelling arguements in our data story telling by using them to support our local findings.
With that being said, our findings are still very relevant to local descision makers because they are very clear on the where and the when but not the why.
.